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1. INTRODUCTION 

The coronavirus virus (CoV) is a particular disease virus and enhances the existing disease in the 
human body, making it a very dangerous virus [1], [2]. The virus causes headaches, breathability, 
maldigestion, and liverwort and harms animals such as cows, horses, and pigs kept, rose, and used by people, 
as well as various wild animals [3]. The severe acute respiratory syndrome (SARS) epidemic and the 
explosion of the middle east respiratory syndrome (MERS) showed in 2002-2003 the probability of the 
newly transmitted human, animal, and human coronavirus disease (COVID-19) in humans, and vice versa 
[4]. Although such cases are much rarer, they do exist. The effect of secret pneumonia on the entire world is 
a remarkable subject in late December 2019 [5]. On January 30, 2020, India announced the first case of 
coronavirus disease COVID-19 [6]. In India, 247,857 cases were reported, 119,293 of which were recovered 
and 6,954 were dead by 6th June 2020. After that date, new cases, which number about 10,000, continue to 
come to light every day. All of these details are accurately given to us on the website [7]. 

In Huanan Market, Wuhan, China, COVID-19 was the first case reported [8]. The main reason for 
this virus’ spreading is the animal-to-human transmission. Yet the next COVID-19 cases were not related to 
the method of subjection. Therefore, it has been concluded that human-to-human transmission and the 
primary recurring reason for COVID-19 spread is people with viruses that are indicative. The probability that 
COVID-19 will be transmitted appears to be very rare before symptoms progress, although the virus cannot 
be prohibited from transmission. In addition, each person is advised that people who are symptomatic and 
asymptomatic may pass the virus, and the only way to be safe with this virus is by social distance. Rhinovirus 
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and influenza, as well as additional wheezing bacteria, are believed to be the main reason for the virus 

spreading the droplets of a person's sneeze and cough [9]. 

There is currently no specific antiviral therapy isolated and reassuring to COVID-19. The effects of 
interferon (IFN) and ribavirin recombination on COVID-19 are very small. Several valuable efforts to 
develop new CoV protease, polymerase, and entry protein antivirals were undertaken. However, a few of 
them in clinical trials have proven worthwhile. The fact of their benefit in COVID-19 treatment has shown 
that patients who are recovered from COVID-19 can give plasma and antibodies. In addition, various vaccine 
schemes such as the use of disabled virus live attenuated viruses, a viral vector vaccine, subunit injection, 
recombinant protein, and DNA vaccines were developed [10]. Until now no effective COVID-19 injection or 
therapy has been provided, but the best measures are to monitor the source of infection, early diagnosis, 
reporting, isolating, supporting therapies, and to keep out the inherent anxiety on time [11]. Every person will 
benefit from the blocking of the COVID-19 virus or its inflammation from good exclusive hygiene, a formed 
and suitable mask, ventilation, and keeping away from crowded areas [12]. The major contribution of this 
paper is: 

— A data-driven predictive analysis of the COVID-19 among different states in India. The analysis is done 
after pre-processing the data such as handling missing values and reduction of redundant data. 

— Proposed a novel ensemble predictive model using linear regression, polynomial regression, and 
support vector machine (SVM) regression models. The model is predicting the number of confirmed 
cases from 30" May 2021 to 15" June 2021on the data available from 22 January 2020 to 29 May 2021 
in India. 

The rest of this paper is organized as some literature published in the area of analysis and prediction 
of COVID-19 is presented in section 2. The detailed methodology including data collection, data pre- 
processing and feature reduction, data visualization, and all the regression models with the proposed novel 
ensemble regression model are described in section 3. Thereafter results and discussion of the proposed 
model are presented in the section 4 followed by a conclusion in section 5. 


2. RELATED WORK 

Researchers, scientists, and medical professionals are executing a multitude of studies on COVID- 
19 to develop various types of models for the prediction of COVID-19. Some of these existing publications 
are addressed in this section. Yadav et al. [13] focused on the prediction of the transmission of the COVID- 
19 virus and the scenario of the spread. A novel support vector regression model is proposed instead of a 
simple regression line to obtain better classification accuracy and the result is compared with simple linear 
regression and polynomial regression model. The model predicts the spread of coronavirus, analyses the 
growth rate, transmission rate, number of recoveries, and the correlation between coronavirus and weather. 
Khanday et al. [14] proposed various classical and ensemble ML techniques which are used for classification 
after the feature engineering process to a better understanding of the viral spread of COVID-19 and the 
performance is measured in terms of accuracy, precision, recall, and F1 score. Rahimi et al. [15] designed 
mathematical models for COVID-19 based on susceptible, infected, and recovered (SIR) cases and 
susceptible, exposed, infected, quarantined, and recovered (SEIQR) cases with some parameter’s settings and 
optimization algorithms. The optimized SIR and SEIQR models were also compared with ML models and 
the results demonstrate optimized SIR and SEIQR model performs better. Tiwari et al. [16] predicted the 
number of confirmed, recovered, and death cases of COVID-19 in India by using the machine learning 
approach that was employed by the Chinese pattern. They concluded that the growth rate will be higher in the 
3rd and 4th week of April 2020 and is controlled at the end of May 2020. Tomar et al. [17] proposed a data- 
driven model using the long short-term memory (LSTM) technique and curve fitting for the forecasting of 90 
days of confirmed and recovered cases of COVID-19 in India. They also analysed the impact of social 
distancing and lockdown on the spread of the virus. Wang et al. [18] proposed a hybrid predictive model 
based on the logistic model and FbProphet model by concluding the infection rate will be in pick by late 
October 2020. The logistic model is used to fit the cap of the trend of the epidemic and then fed to the 
FbProphet model to derive the curve and trend of the epidemic globally. Tuli et al. [19] proposed an 
improved mathematical model using machine learning and cloud computing to analyse and predict the 
growth of the epidemic globally. They have taken 5 sets of a global dataset of daily confirmed cases of 
COVID-19 to find the best fitting distribution model. Finally, they found the 5 best distributions from which 
Robust Weibull using an iteratively weighted approach performed better than others. They have also 
identified future research directions and emerging trends in their research. From these studies, it can be 
concluded that the COVID-19 virus is much more similar to SARS and MERS virus, and the infection rate is 
higher than the fatality rate. Moreover, researchers are continuously working to build the models for 
prediction and forecasting of the coronavirus and most of the models are built using machine learning. 
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3. METHODOLOGY 

The current section holds the dataset description along with the methodologies used for prediction. 
The flowchart of the adopted methodology, which includes analysis and prediction of confirmed COVID-19 
cases in India, is shown reflected in Figure 1. The COVID-19 dataset of India is collected in the first step. 
The data is then pre-processed and features reduced, followed by data visualization. The existing regression 
models including linear regression [20], [21], polynomial regression [22], [23], and SVM regression [24], 
[25] are implemented to predict the confirmed cases over the next 17 days. Finally, an ensemble regression 
model for prediction is proposed, which outperforms the existing regression models. 
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Data Pre-processing 
& Feature Reduction 


¥ 


Data Visualization 


Polynomial Regression SVM Regression 


Time-series Forecasting 


Linear Regression 


Proposed Ensemble 
Technique 


Figure 1. Flowchart of the methodology 


3.1. Data collection 

The dataset used in this study is collected from Johns Hopkins University Centre for Systems 
Science and Engineering (JHU CSSE). The dataset of COVID-19 in India is downloaded in Comma 
separated values (CSV) file format with 14,654 samples. The data is taken from 22 January 2020 to 29 May 
2021 which consists of date, time, state/union territory, ConfirmedIndianNational, 
ConfirmedForeignNational, cured, deaths, and confirmed as input variables that are described in Table 1. It is 
also observed that the numbers of reported new cases are increasing with time. 


Table 1. Description of the features of the COVID-19 dataset in India 


Feature Name Description of the Feature 
Date Date represents the observation date on which how many numbers of COVID-19 positive cases have 
been reported in India 
Time Time represents the time of the observed date on which COVID-19 positive cases are reported 
State/Union Territory State/Union Territory represents the name of the state or union territory in India where the COVID-19 
cases were reported 
ConfirmedIndianNational ConfirmedIndianNational represents the number of COVID-19 cases that originate in India 
ConfirmedForeignNational ConfirmedForeignNational represents the number of COVID-19 cases found in India but originated 
from foreign countries. 
Cured Cured represents the total number of recovered cases of COVID-19 in India till the observed date 
Deaths Deaths represent the total number of deaths in COVID-19 in India till the observed date 
Confirmed Confirmed represents the total number of confirmed cases of COVID-19 in India till the observed date 


3.2. Data pre-processing and feature reduction 

In this study, the dataset is analysed in Jupyter Notebook with Python 3 software by importing the 
corresponding libraries. Data pre-processing is done by converting the Observation Date to date-time format 
and feature reduction is done by dropping two features which are ConfirmedIndianNational and 
ConfirmedForeignNational columns. Those two features are dropped because they affected the dataset only 
at the beginning of the virus spread. But later travel is banned and the virus spread is community transferable. 
Further analysis is done state-wide on the dataset of India which is one of the most populated countries in the 
world. 
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3.3. Data visualization 
The graphical representation of the growth of COVID-19 disease across the country is shown in 


Figure 2 and it is an exponential growth in the reported cases. It is observed that the spread of the virus is 
very less in the first two months of the beginning of the outbreak, but later the spread is very high. 
Figures 3 to 5 show the total number of confirmed cases, recovered, and deaths respectively from the 
different states and union territories in India. Maharashtra, Karnataka, and Kerala are the three most affected 
states in India. The objective of the study is to understand and visualize the outbreak of COVID-19 disease. 
Figure 6 gives the detailed information of confirmed cases, cured, deaths, active cases, death rate, and cured 
rate among different states and union territories in India which is sorted according to the confirmed cases. 


The more the number of cases the higher the intensity of the colour. 
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Figure 2. Spread of COVID-19 across the country 
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Figure 4. Recovered cases of COVID-19 
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State/UnionTerritory Confirmed Active Death Rate (per 100) Cure Rate (per 100) 
22 Maharashtra 4 15 4 641 
17 Karnataka 1255797 BR) 
18 Kerala 178638 1389515 
37 Uttar Pradesh 1425916 14501 1151571 259844 
32 Tamil Nadu 1297600 14974 1151058 131468 
10 Delhi 1273035 18398 1164008 90629 
4 Andhra Pradesh 1228186 8446 1037411 182329 
39 West Bengal 935066 11964 800328 122774 
7 Chhattisgarh 816489 9950 675294 131245 
30 Rajasthan 702568 5182 499376 198010 
12 Gujarat 645972 8035 490412 147525 
241 Madhya Pradesh 637406 6160 542632 88614 
13 Haryana 573815 5137 452836 115842 
4 Bihar 553803 M77 =435574 115152 
27 Odisha 500162 2121 4423257 «74784 
33 Telangana 481640 2625 405164 73851 
4 Telengana 443360 2312 362160 78888 
29 Punjab 416350 9979 339803 66568 
3 Assam 277687 1531 242980 33176 
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15 Jammu and Kashmir 201511 2562 157283 41666 


Figure 6. State/union territory-wide analysis of COVID-19 in India 
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3.4. Proposed ensemble model 

Ensemble learning is the process of integrating several individual models to improve the model’s 
accuracy and performance. Algorithm-1 explains the working principles of the proposed ensembled model. 
The proposed ensemble model is performed for the available data from 22 January 2020 to 29 May 2021. It is 
assumed that the coronavirus infected person can transmit the virus to another person directly or indirectly as 
it is a transmitted disease and consequently the number of reported cases is growing rapidly. Hence, we are 
developing an ensemble machine learning model based on Linear Regression, Polynomial Regression, and 
SVM Regression. The data is split into 80% and 20% for training and testing purposes respectively. While 
developing the model the dataset was analyzed using the functions available in Python. Table 2 shows the 
methods, packages, and parameters used by these models for the prediction of confirmed cases of COVID-19 
in India in the Python programming language. The linear regression, polynomial regression, and SVM 
regression models are generated for the forecasting of confirmed cases over the next 17 days in India. 
Finally, an ensemble model is proposed by taking weighted average predictions. This ensemble model 
combines the prediction from each statistical model with a weighted value proportionally and the predicted 
values are collected. 


Table 2. Description of machine learning models 


Model Method Required Package Tuning Parameter 
Linear Regression LinearRegression() LinearRegression Normalize=True 
Polynomial Regression PolynomialFeatures() PolynomialFeatures Degree=8 
SVM Regression SVRO SVR C=1,degree=6,kernel='poly’, epsilon=0.01 


Algorithm-1 for proposed ensemble model 
Start 
Load the dataset with a CSV file. The data is taken from 22 January 2020 to 29 May 2021. 
for i in range(1,18): 
new date.append(datewise india. index [-1]+timedelta (days=i) ) 
new prediction lr.append(lin_ reg.predict (np.array(datewise india["DaysSince"] .max()+i). 
reshape (-1,1)) [0] [0] 
new prediction poly.append(poly.predict (np.array(datewise india["DaysSince"].max()+i).r 
eshape (-1,1)) [0] [0] 
new prediction svm.append(svm.predict (np.array (datewise india["Days 
Since"] .max()+i).reshape(-1,1)) [0] [0]) 
The forecast data is obtained from 30%? May 2021 to 15%? June 2021. 
The result is displaying in 
model predictions=pd.DataFrame (zip(new_date,new _prediction_lr,new_ prediction poly,new_p 
rediction_svm), columns=["Dates","Linear Regression Prediction","Polynonmial Regression 
Prediction","SVM Regression Prediction"] ) 
Proposed ensemble model predictions 
final pred =[] 
for i in range(0,17): 
x = np.sum(prediction_lr[i]*0.5+prediction poly[i]*0.07+prediction_svm[i]*0.6) 
final _pred.append (x) 
End 


4. RESULT ANALYSIS AND COMPARISON OF ACTUAL AND PREDICTED CASES 

The pandemic has had a significant impact on world health and the economy. In this study, a total of 
14,653 number samples of the COVID-19 dataset of India are taken into consideration. The forecast models 
are built using linear regression, polynomial regression, and SVM regression. The graphical representations 
of the models are shown in Figures 7 to 9 respectively. The training data for confirmed cases is represented 
by the gray line, while the best fit line for the corresponding model is represented by the black dot line. The 
formulation of root means square error (RMSE) values evaluated after the prediction of confirmed cases, 
recovered, and deaths of the models and is presented in Table 3. RMSE is the prediction error that tells us 
how concentrated the data is around the best fit line. It is commonly used for forecasting and regression 
analysis to verify the experimental result. The lower the value of RMSE the better is the model. In this study, 
it is found that the SVM Regression model gives a lower RMSE value for the confirmed cases. Hence, the 
better the model is as compared to the other two models. But for the number of recovered cases and deaths 
polynomial regression model gives little better results compared to the SVM regression model. 


RMSE = [F — 07 (1) 
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Where F=forecasts or expected values and O=observed values. 

COVID-19 analysis is carried out in India from 22 January 2020 to 29 May 2021. An ensemble 
model is proposed in this paper, which used a weighted average of predictions. This novel ensemble model 
proportionally combines each statistical model’s forecast with a weighted value. The proposed models will 
forecast over the next 17 days that is from 30" May 2021 to 15 June 2021 confirmed cases in India which 
are presented in Table 4. When we compare the predicted data to the actual data, the ensemble regression 
model has a higher percentage of accuracy than the other regression models provided in Table 5. These 
predicted values are compared to actual data, and the model’s accuracy is assessed using (2). The ensemble 
model outperformed the statistical prediction models, according to the findings. 


Actual Value—Predicted Value 


%Acc = 100 — ( * 100) (2) 


Actual Value 
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Figure 8. Polynomial regression model for confirmed cases 
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Figure 9. SVM regression model for confirmed cases 
Table 3. RMSE of the models in millions 
Linear Regression _ Polynomial Regression _ SVM Regression 
Confirmed 10.24x10° 5.12x10° 1.60x10° 
Recovered 8.14x10° 0.77x10° 0.85x10° 
Death 0.07x10° 0.01x10° 0.02x10° 
Table 4. Predicted values of the models 
Sl no Date Actual Data Linear Regression _ Polynomial Regression _ SVM Regression _ Ensemble Technique 
1 30-05-2021 28047000 15375267 40035027 27103831 26752384 
2 31-05-2021 28173883 15413275 41177458 27416098 27038718 
3 01-06-2021 28307035 15451283 42357143 27731587 27329594 
4 02-06-2021 28441079 15489290 43575080 28050325 27625096 
5 03-06-2021 28573503 15527298 44832289 28372339 27925313 
6 04-06-2021 28693957 15565306 46129813 28697656 28230334 
7 05-06-2021 28808445 15603313 47468717 29026303 28540249 
8 06-06-2021 28909654 15641321 48850089 29358306 28855150 
9 07-06-2021 28995458 15679329 50275040 29693694 29175134 
10 08-06-2021 29088245 15717336 51744706 30032493 29500294 
11 09-06-2021 29182128 15755344 53260244 30374732 29830729 
12 10-06-2021 29273977 15793352 54822838 30720439 30166538 
13 11-06-2021 29358551 15831359 56433696 31069641 30507823 
14 12-06-2021 29439076 15869367 58094052 31422367 30854687 
15 13-06-2021 29510077 15907375 59805163 31778645 31207236 
16 14-06-2021 29570085 15945382 61568316 32138503 31565575 
17 15-06-2021 29632302 15983390 63384821 32501971 31929815 
Table 5. The Accuracy of the models 
Date % Of %Of Accuracy %Of %Of Accuracy Date % Of % Of % Of % Of 
Accuracy of of polynomial Accuracy of of the Accuracy of Accuracy of Accuracy Accuracy of 
Linear Regression SVM ensemble Linear polynomial of SVM_ the ensemble 
Regression Regression method Regression Regression Regression _ method 
30-05-2021 54.18 57.25 96.63 95.39 07-06-2021 54.07 26.61 97.59 99.38 
31-05-2021 54.70 53.84 97.31 95.98 08-06-2021 54.03 22.11 96.75 98.58 
01-06-2021 54.58 50.36 97.96 96.55 09-06-2021 53.98 17.49 95.91 97.77 
02-06-2021 54.46 46.78 98.62 97.14 10-06-2021 53.95 15.36 95.07 96.95 
03-06-2021 54.34 43.09 99.29 97.74 11-06-2021 53.92 12.72 94.17 96.08 
04-06-2021 54.24 39.23 99.98 98.39 12-06-2021 53.90 10.66 92.31 95.19 
05-06-2021 54.16 35.22 99.24 99.07 13-06-2021 53.90 9.68 92.31 94.24 
06-06-2021 54.10 31.02 98.44 99.82 14-06-2021 53.92 8.68 91.31 93.25 
07-06-2021 54.07 26.61 97.59 99.38 15-06-2021 53.93 9.68 90.31 92.24 
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A reliable method, regression analysis is to determine the relationship between independent variable 
date (x) and dependent variable confirmed cases (y). Linear regression, Polynomial regression, and SVM 
regression are some of the known regression methods that are easy to implement and relatively efficient. 
These approaches are used to provide predictions for confirmed cases of COVID-19. Thereafter an ensemble 
model is proposed which outperformed the existing models as the predicted data is very close to the actual 
data. 


5. CONCLUSION AND FUTURE SCOPE 

In this study, the spread of COVID-19 in different Indian states is discussed, and an ensemble model 
employing linear regression, polynomial regression, SVM regression is proposed and experimentally verified 
for forecasting confirmed COVID-19 cases in India. All of the models are assessed for correctness here. 
When the models are compared, it is observed that the ensemble model provides more accurate predicted 
values for time series data forecasting than the other models. According to the finding, substantially more 
COVID-19 restriction requirements are needed to control the spread of the disease. The prediction could aid 
in healthcare decision-making, and proactive measures could be made to decrease human life loss. The 
proposed ensemble model can be extended for the prediction of recovery and fatalities in a certain location. 
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